Skip to content

Conversation

@MasterPtato
Copy link
Contributor

@MasterPtato MasterPtato commented Jan 12, 2026

Fixes RVT-5331

Copy link
Contributor Author

MasterPtato commented Jan 12, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@linear
Copy link

linear bot commented Jan 12, 2026

@claude
Copy link

claude bot commented Jan 12, 2026

PR Review: Add Metrics to Epoxy

This PR adds comprehensive metrics to the Epoxy consensus system. Overall, the implementation is well-structured and follows good practices. Here are my findings:

✅ Strengths

  1. Comprehensive coverage: The metrics cover all critical aspects of the consensus protocol - proposals, pre-accept/accept/commit operations, quorum attempts, replica state, and errors.

  2. Good metric naming: Follows Prometheus naming conventions with clear, descriptive names prefixed with epoxy_.

  3. Appropriate metric types: Uses the right metric types for each use case:

    • Counters for cumulative events (proposals, accepts, commits, errors)
    • Gauges for current state (ballot numbers, instance numbers, replica counts)
    • Histograms for duration tracking (proposal duration, request duration)
  4. Structured logging compliance: The timing instrumentation using Instant is clean and consistent.

  5. Error tracking: Good coverage of error conditions (ballot rejections, interference detection, invalid ballots).

⚠️ Issues & Concerns

1. Potential Issue in Proposal Metrics (ops/propose.rs:93-100)

The proposal metrics are recorded after the operation completes, which is correct for duration. However, the status label logic has a subtle issue:

metrics::PROPOSALS_TOTAL
    .with_label_values(&[if let ProposalResult::Committed = res {
        "ok"
    } else {
        "err"
    }])
    .inc();

Issue: ProposalResult::CommandError is treated as "err", but command errors are not consensus failures - they represent successful proposals where the command couldn't be applied (e.g., CAS mismatch). This conflates two different concepts.

Recommendation: Use a more granular status:

  • "committed" - Successfully committed
  • "consensus_failed" - Consensus protocol failed
  • "command_error" - Proposal succeeded but command failed

2. Missing Metric Recording on Early Return Paths

In ops/propose.rs, if an error occurs during the consensus process (lines 73-90), the metrics at the end won't be recorded because the ? operator returns early. This means failed proposals won't be counted if they fail with an error rather than returning ProposalResult::ConsensusFailed.

Recommendation: Consider using a defer-like pattern or ensuring metrics are recorded even on error paths.

3. Inconsistent Error Labeling in message_request.rs:35-37

.with_label_values(&[request_type, if res.is_ok() { "ok" } else { "err" }])

This lumps all errors together without distinguishing the error type. For a consensus system, it would be valuable to know if errors are due to:

  • Network timeouts
  • Ballot rejections
  • Invalid state
  • Internal errors

Recommendation: Consider more granular error labels or at least separate user/system errors.

4. REPLICAS_TOTAL.reset() in update_config.rs:15 - Potential Data Loss

This is a critical concern:

metrics::REPLICAS_TOTAL.reset();
for replica in &update_config_req.config.replicas {
    metrics::REPLICAS_TOTAL
        .with_label_values(&[...])
        .inc();
}

Issues:

  • reset() clears all label combinations, not just the ones about to be set
  • If there's a prometheus scrape between reset() and the loop completion, you'll get inconsistent data
  • In multi-replica scenarios, this could reset metrics from other replicas if they share the same registry

Recommendation:

  • Either use set() instead of reset() + inc() pattern
  • Or track the previous state and only update changed values
  • Document that this assumes single-replica-per-process deployment

5. Missing Documentation in metrics.rs

While the metric descriptions are present, there's no module-level documentation explaining:

  • What these metrics measure in the context of EPaxos
  • Expected patterns (e.g., what's a healthy accept/reject ratio?)
  • How to use them for troubleshooting

Recommendation: Add module-level documentation with examples.

6. No Tests

The PR adds 262 lines of metric instrumentation but includes no tests verifying:

  • Metrics are incremented correctly
  • Labels are applied properly
  • Edge cases (e.g., what happens when ballot is rejected?)

Recommendation: Add integration tests that verify key metrics are recorded correctly.

📊 Performance Considerations

  1. Metric overhead: The current implementation is efficient - using label lookups and simple increments. The histogram buckets from BUCKETS should be reviewed to ensure they match the expected latency distribution of epoxy operations.

  2. Label cardinality: All labels use a fixed, small set of values (good practice). No unbounded cardinality issues detected.

  3. String allocation in http_client.rs:97: quorum_type.to_string().as_str() creates a temporary string on every quorum attempt. Consider using a match statement to return &'static str directly.

🔒 Security Considerations

No security issues detected. Metrics don't expose sensitive data.

📝 Code Style & Best Practices

  1. Import organization: ✅ Imports are at the top of files as per CLAUDE.md guidelines
  2. Hard tabs: ✅ Will be enforced by rustfmt
  3. Structured logging: ✅ Already uses structured logging patterns
  4. Error handling: ✅ Appropriate use of anyhow Result types

🎯 Recommendations Summary

High Priority:

  1. Fix the REPLICAS_TOTAL.reset() race condition
  2. Add tests for metric recording
  3. Fix proposal status labels to distinguish command errors from consensus failures

Medium Priority:
4. Handle metrics on error paths in epoxy_propose
5. Add more granular error labels for request types
6. Optimize string allocation in quorum type conversion

Low Priority:
7. Add module documentation for metrics
8. Review histogram bucket configuration

Verdict

This is a solid metrics implementation that provides good observability into the Epoxy consensus system. The main concerns are around the reset() race condition and the need for test coverage. With those addressed, this will be a valuable addition to the codebase.

Suggested Action: Request changes to address the high-priority items before merging.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Jan 12, 2026

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3847

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3847

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3847

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3847

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3847

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3847

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3847

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@3847

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3847

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3847

commit: b007d1a

@MasterPtato MasterPtato force-pushed the 01-12-fix_epoxy_add_metrics branch from 9d56f3c to 816b2fc Compare January 13, 2026 00:27
@MasterPtato MasterPtato force-pushed the 01-12-chore_clean_up_errors branch 2 times, most recently from 485634a to 964ae0f Compare January 13, 2026 00:44
@MasterPtato MasterPtato force-pushed the 01-12-fix_epoxy_add_metrics branch 2 times, most recently from 02463d5 to 2ae3899 Compare January 13, 2026 01:01
@MasterPtato MasterPtato force-pushed the 01-12-chore_clean_up_errors branch from 964ae0f to f3c17ef Compare January 13, 2026 01:02
@MasterPtato MasterPtato force-pushed the 01-12-fix_epoxy_add_metrics branch from 2ae3899 to 5d200c7 Compare January 13, 2026 01:29
@MasterPtato MasterPtato force-pushed the 01-12-chore_clean_up_errors branch from f3c17ef to b5999a3 Compare January 13, 2026 01:29
@MasterPtato MasterPtato force-pushed the 01-12-fix_epoxy_add_metrics branch from 5d200c7 to 69795ba Compare January 14, 2026 02:05
@MasterPtato MasterPtato force-pushed the 01-12-chore_clean_up_errors branch 2 times, most recently from 27fc351 to 77119ca Compare January 14, 2026 02:53
@MasterPtato MasterPtato force-pushed the 01-12-fix_epoxy_add_metrics branch 2 times, most recently from 7c91643 to 40274fa Compare January 14, 2026 03:06
@MasterPtato MasterPtato force-pushed the 01-12-chore_clean_up_errors branch from 77119ca to 1fe301c Compare January 14, 2026 03:06
@MasterPtato MasterPtato force-pushed the 01-12-fix_epoxy_add_metrics branch from 40274fa to cb0fad3 Compare January 14, 2026 19:47
@MasterPtato MasterPtato force-pushed the 01-12-chore_clean_up_errors branch 2 times, most recently from 830c32d to dfbb13a Compare January 14, 2026 20:01
@MasterPtato MasterPtato force-pushed the 01-12-fix_epoxy_add_metrics branch from cb0fad3 to 7a06ae3 Compare January 14, 2026 20:01
@MasterPtato MasterPtato force-pushed the 01-12-chore_clean_up_errors branch from dfbb13a to 81663b7 Compare January 14, 2026 22:52
@MasterPtato MasterPtato force-pushed the 01-12-fix_epoxy_add_metrics branch 2 times, most recently from 5109a19 to 2e32f50 Compare January 14, 2026 23:02
@MasterPtato MasterPtato force-pushed the 01-12-chore_clean_up_errors branch from 81663b7 to 260900c Compare January 14, 2026 23:02
@MasterPtato MasterPtato force-pushed the 01-12-fix_epoxy_add_metrics branch from 2e32f50 to b007d1a Compare January 14, 2026 23:07
@MasterPtato MasterPtato force-pushed the 01-12-chore_clean_up_errors branch from 260900c to a4f42a6 Compare January 14, 2026 23:07
@MasterPtato MasterPtato force-pushed the 01-12-chore_clean_up_errors branch from a4f42a6 to 520df1b Compare January 14, 2026 23:39
@MasterPtato MasterPtato force-pushed the 01-12-fix_epoxy_add_metrics branch from b007d1a to 3cc8cea Compare January 14, 2026 23:39
@graphite-app
Copy link
Contributor

graphite-app bot commented Jan 14, 2026

Merge activity

  • Jan 14, 11:40 PM UTC: MasterPtato added this pull request to the Graphite merge queue.
  • Jan 14, 11:41 PM UTC: CI is running for this pull request on a draft pull request (#3908) due to your merge queue CI optimization settings.
  • Jan 14, 11:42 PM UTC: Merged by the Graphite merge queue via draft PR: #3908.

graphite-app bot pushed a commit that referenced this pull request Jan 14, 2026
@graphite-app graphite-app bot closed this Jan 14, 2026
@graphite-app graphite-app bot deleted the 01-12-fix_epoxy_add_metrics branch January 14, 2026 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants